Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
Motivation
The following papers motivated this case study.
Twenge JM, Cooper AB, Joiner TE, Duffy ME, Binau SG. Age, period, and cohort trends in mood disorder indicators and suicide-related outcomes in a nationally representative dataset, 2005-2017. J Abnorm Psychol.128,3 (2019):185-199. doi:10.1037/abn0000410
Olfson, M., Blanco, C., Wang, S., Laje, G. & Correll, C. U. National Trends in the Mental Health Care of Children, Adolescents, and Adults by Office-Based Physicians. JAMA Psychiatry. 71, 81 (2014):81-90. doi: 10.1001/jamapsychiatry.2013.3074.
The main findings of the first article are:
Rates of major depressive episode in the last year increased 52% 2005–2017 (from 8.7% to 13.2%) among adolescents aged 12 to 17 and 63% 2009–2017 (from 8.1% to 13.2%) among young adults 18–25.
Serious psychological distress in the last month and suicide-related outcomes (suicidal ideation, plans, attempts, and deaths by suicide) in the last year also increased among young adults 18–25 from 2008–2017 (with a 71% increase in serious psychological distress), with less consistent and weaker increases among adults ages 26 and over.
Cultural trends contributing to an increase in mood disorders and suicidal thoughts and behaviors since the mid-2000s, including the rise of electronic communication and digital media and declines in sleep duration, may have had a larger impact on younger people, creating a cohort effect.
While the main findings of the second article are:
Compared with adult mental health care, the mental health care of young people has increased more rapidly.
Between 1995-1998 and 2007-2010, visits resulting in mental disorder diagnoses per 100 population increased significantly faster for youths (from 7.78 to 15.30 visits) than for adults (from 23.23 to 28.48 visits) (interaction: P < .001).
Psychiatrist visits also increased significantly faster for youths (from 2.86 to 5.71 visits).
While depression appear to be on the rise for youths, youths also appear to be seeking more mental health care.
In this case study we will evaluate data related to depression episodes and mental health care to evaluate trends overtime. We will be using data from the National Survey on Drug Use and Health (NSDUH). This data was also used in the first study.
Main Questions
Our main questions:
- How have depression rates in American youth changed since 2002, according to the NSDUH data?
- Do mental health services appear to be reaching more youths? How have rates differed between different youth subgroups (gender, ethnicity)?
Learning Objectives
avocado update these!
It may be a good idea to provide a link to Rstudio’s webpage. For the first few months using R, I did not differentiate between R and R Studio. It may be a good distinction to make at least implicitly by providing a link.
In this case study, we will determine the percent of youth in America that have had a major depressive episode in the past year since 2002. We will compare how different youth subgroups have changed over time (by age group (12-13,14-15, and 16-17), gender, ethnicity). We will especially focus on using packages and functions from the Tidyverse, such as rvest. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.

We will begin by loading the packages that we will need:
I made some modifications to the table below. The tidyverse package hyperlink referenced readr. I thought this was incorrect. I changed this to the tidyverse website and provided a different description. If this was indeed a typo, it may need to be fixed in other case studies.
| here |
to easily load and save data |
| tidyverse |
R packages for data science |
| rvest |
to scrape web pages |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Context
According to other sources the rate of suicide has increased for most age groups in the United States over the past decade and a half.

While suicide does appear to be increasing amoung youths it also appears to be increasing amoung middle aged adults as well for both females and males.

According to the CDC:
Since 2008, suicide has ranked as the 10th leading cause of death for all ages in the United States. In 2016, suicide became the second leading cause of death among those aged 10–34 and the fourth leading cause among those aged 35–54.
#### [source]
So although sucide is on the rise for most age groups, sucide is one of the top two contributors to death for youths. Thus this warrents further examination of mental health of American youths.

Besides the US, other countries are also experiencing increased reates of depression in youths. See this report from the World Health Organization about rates of depression in other countries.
Great paper about what may be causing increased dpression - and the caveats of if we actually have increased depression: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3330161/
https://www.nimh.nih.gov/health/publications/teen-depression/index.shtml
According to the National Institute of Mental Health (NIMH):
If you are in crisis and need help, call this toll-free number for the National Suicide Prevention Lifeline (NSPL), available 24 hours a day, every day: 1-800-273-TALK (8255). The service is available to everyone. The deaf and hard of hearing can contact the Lifeline via TTY at 1-800-799-4889. All calls are confidential. You can also visit the Lifeline’s website at www.suicidepreventionlifeline.org.
The Crisis Text Line is another free, confidential resource available 24 hours a day, seven days a week. Text “HOME” to 741741 and a trained crisis counselor will respond to you with support and information over text message. Visit www.crisistextline.org.
Also see here for more information about how to recognise and help youths experiencing symptoms of depression.
Limitations
Perhaps “underestimates in the p-values…” is not the correct way to phrase this. I would look for a better way to word this.
Wording for this section should be reviewed.
There are some important considerations regarding this data analysis to keep in mind:
We treat sample estimates—estimates of the true population value—as observed values. This produces understimates in the p-values of statistical tests conducted.
Furthermore, the sampling mechanism utilized can introduce selection bias in cases where the the sampling methods do not produce a representative sample.
Data is collected from human participants; this presents the potential for information bias, as there is the potential that partificipants in the sampling frame may for a variety of reasons report inaccurate information.
What are the data?
We will be using data from the National Survey on Drug Use and Health (NSDUH) which is directed by the Substance Abuse and Mental Health Services Administration (SAMHSA), an agency in the U.S. Department of Health and Human Services (DHHS).
This survey started in 1971 and is conducted annualy in all 50 states and the District of Columbia. Approximately 70,000 people (age 12 and up) are interviewed each year about health realted issues. Households are randomly selected and than a professional interviewer visists the addresses and asks one or two of the residents to inverview. The interviewer brings a laptop with them that the participants use to fill out the survey which typically takes an hour to complete. If a participant chooses to particpate they receive $30 in cash. All collected information is confidential and is used for disease surveillance and to guide public policy particuarlly focused on drug and alcohol use as well as mental health. See here for more details about the survey.
This data is made available publicly online on the Substance Abuse & Mental Health Data Archive.

At the website for the survey data, you can see that the results are displayed in many tables. Importantly, there is no obvious way to download the data directly from this particular website.

If one clicks on the TOC botton on the far right upper corner they will be directed to another website, where a large pdf document containing of all of the results can be downloaded.
We are interested in investigating how depression rates have changed and how youths are interacting with mental health services. Thus the following tables are of interest to us are:
| Table 11.1A |
Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Numbers in Thousands, 2002-2018 |
| Table 11.1B |
Settings Where Mental Health Services Were Received in Past Year among Persons Aged 12 to 17: Percentages, 2002-2018 |
| Table 11.2A |
Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2004-2018 |
| Table 11.2B |
Major Depressive Episode in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2004-2018 |
| Table 11.3A |
Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Numbers in Thousands, 2006-2018 |
| Table 11.3B |
Major Depressive Episode with Severe Impairment in Past Year among Persons Aged 12 to 17, by Demographic Characteristics: Percentages, 2006-2018 |
| Table 11.4A |
Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Numbers in Thousands, 2004-2018 |
| Table 11.4B |
Receipt of Treatment for Depression in Past Year among Persons Aged 12 to 17 with Major Depressive Episode in Past Year, by Demographic Characteristics: Percentages, 2004-2018 |
Data Import
Data is often made available online. Usually, the data we are interested in is made available for download on the page as a delimited text file or an excel file. However, sometimes data is not made available in this manner, such as the NSDUH survey data.
How do we proceed in this scenario?
We can manually copy each cell of data, however, this process is often inefficient, subject to error, and not reproducible. Say we wanted to run an analysis next year on the next years data and it happens to be formatted in the same way.
We can also use R for web scraping.
Web scraping is the process of extracting data from a website.
Basic steps of web scraping
There are two main steps to web scraping:
Identify location of data on the webpage that will be scraped
Save the webpage element to an object
We accomplish STEP 1 with our web browser.
We accomplish STEP 2 in the R programming environment.
I could not find the animation that I referred to on several occasions.
However, I was able to find the sources that I consulted to create the three step rvest process. They are included below
RStudio
Blog
In this case study we will scrape data from the tables on the NSDUH survey website. This data is available in a large PDF with all the results form the year. However it is not easy to find this PDF and it would be difficult and time consuming to find our tables of interest and to extract the data from the pdf with pdftools. Again, if we instead decided to copy paste the data from the website to another file that we would also need to import, this would not be as efficient or reproducible and might result in errors.
Alternatively, we will use the rvest package to scrape the data directly from the tables on the website. Assuming the data next year would be displayed in a similar manner, this could allow us simply modify our code based on the url for the data next year to run the same analysis on the data easily.
The rvest package can be thought of as the pdftools package for webscraping. Upon pulling the data, additional wrangling will likely be required; but like the pdftools package, rvest streamlines the extraction process.
Steps for scraping tables
The two web scraping steps for these tables can be broken down even further:
- Identify location of data that will be scraped
- right-click to inspect element (webpage)
- hover pointer over components of element (webpage) until the data has been found
- copy Xpath of data sought
- Save webpage element to an object in R
- import html code for the webpage
- extract pieces of HTML documents (webpage) using Xpath
- parse the extracted data into a data frame
Below is a animated overview of the process.
